﻿ Reconstructing the Diachronic Morphology of Romanian from Dictionary Citations Dan Cristea1,2, Radu Simionescu1, Gabriela Haja3 1 Faculty of Computer Science, “Alexandru Ioan Cuza” University of Iași 2 Institute for Computer Science, Romanian Academy, the Iași branch 3 “Alexandru Philippide” Institute of Philology, Romanian Academy, the Iași E-mail: radu simionescu@info uaic ro, dcristea@info uaic ro, gabihaja@yahoo com Abstract This work represents a first step in the direction of reconstructing a diachronic morphology for Romanian The main resource used in this task is the digital version of the Romanian Language Thesaurus Dictionary (eDTLR) This resource offers various usage examples for its entries, citations extracted from old and modern Romanian texts The concept of “word deformation” is introduced and classified into more categories The research conducted aims at detecting one type of such deformations occurring in the citations – changes only in the root of the old form words, without the migration to another paradigm An algorithm is presented which automatically infers old root forms, and which is based on a paradigmatic data model of the current Romanian morphology Having the inferred roots and the paradigms that they are part of, old flexion forms of the words can be deduced Even more, by exploiting the chronology of the citations, the inferred old word forms can be framed in certain periods of time, finally configuring an important linguistic resource for researchers interested in the evolution of the Romanian language Keywords: morphology, diachronic, language evolution external database, called chronology, has been 1 Morphological sources for Romanian An mpiled, as pairs code-year or code-interval, where the language co year/interval are publishing dates of the source As such, a eDTLR1 (Cristea et al , 2007) is the digital version of the certain morphological form of the title word can be Romanian Language Dictionary (DLR), edited by precisely located in time Romanian Academy, between 1906 and 2010 Apart from XML representations of the entries, eDTLR includes also AnaMorph (Cristea, Forăscu, 2006) is a paradigmatic part of the sources that have been used to build the corpus word flexing instrument for Romanian It sees a word as a of citations, in digital form, and the software to access lexical unit made up of two morphemes, a root and an them The Dictionary basically describes, following the ending In its morphological variations, a word can have lexicographic norms of the Academy, all words registered more roots, as given by the irregularities in declination or in documents and texts (from Scrisoarea lui Neacșu conjugation There are mainly two causes of these /Letter of Neacșu, 1521, the first known text in Romanian, irregularities: inheritance of old forms and phonetic until today) It includes etymology and each word sense is alternations The number of roots, the complete set of illustrated by quotations from a large collection of texts, endings and the association of different roots with endings attributed to all social and cultural domains (2500 titles in flexing, assembles a paradigm (Tufiș, 1989) Usually, a and approx 3000 volumes) paradigm is shared by a class of words having the same part of speech In AnaMorph, the paradigms have been The morphological variation in the evolution of defined manually, following a grammar of modern Romanian is mirrored in the rich collection of citations Romanian As such, 366 paradigms, which include 150 that eDTLR includes (more than 1 3 million) Richly sets of endings, completely cover the morphology of sensed words could display tenths of pages in the original nouns, verbs and adjectives of the contemporary paper dictionary (for instance, 100 pages for a verb like a Romanian, as given by DEX (in its online version2) veni/to come Moreover, the citations cover all historical periods in the evolution of written and spoken Romanian 2 Going back in time language (Rosetti, et al , 1968; Gheție, 1977; Gheție and If we compare the language spoken or written today with Chivu, 2000), which makes them extremely valuable as a that of the first quarter of the previous century we get source of data in the attempt to reconstruct a diachronic fewer differences than between the today Romanian and morphology Each citation includes exactly one th that of the middle 19 century The more we go back in occurrence of the title word Moreover, citations are the past, the bigger the differences are But this can be paired with codes identifying uniquely the source taken also in the sense that we expect to find more document and the pages from where it has been extracted common word forms between today’s Romanian language and the one spoken 75 years ago than between 1 Built between 2007 – 2010, in a project financed by Romanian today Romanian and the one spoken 160 years ago Even Government and coordinated by UAIC-FII (https://consilr info uaic ro/edtlr/wiki/index php?title=Digitalizi ng the Thesaurus Dictionary of the Romanian Language) 2 www dexonline ro more, changes are not abrupt, affecting the whole Romanian language The occurrence of the title word in vocabulary at once, but merely involve the class of words the citations is detected imposing a belonging to the same paradigms and sometimes only one-occurrence-of-title-word-per-citation restriction, and isolated words Mainly, at one moment in time or over a making use of a variation of the Levenshtein distance certain interval, one paradigm gradually changed Very rarely, abrupt changes may also occur, in which case they Given a known title word (a lemma) l, framed under the are mainly issued by rules imposed by the Romanian modern paradigm p, and an unknown flexion form f Academia and which were gradually adopted by the which is extracted from a citation which corresponds to l, society3 assume that f is an old form of a root deformation of l Next, verify the assumption made Determine if f can be The research presented in this paper aims at inferring old framed under p and if so, infer a root and its flexion forms forms and associate them with certain periods of time, Solving such a problem is required when, given a title based on the set of examples contained in eDTLR and the word and an old flexion form extracted from one of its use of the chronology Then, the timing associated citations, we want to establish if this old form is a root with language changes is used to put in evidence change: it is part of the same paradigm as the title word phenomena related to the evolution of the Romanian but there is a change in the root language We define s(p) as the list of suffixes indicated by the Since citations are paired with years/intervals, this task paradigm p seems straightforward Still, two things do complicate it very much: the recognition of the morphological tags of To determine if f can be framed under the paradigm p, for the occurrences of the title word in the citations and the every suffix s(p)[k] that matches f at the end we assume fact that more forms could have been in use in the same that f might appear from a deformed root plus the suffix moment or over the same period s(p)[k] By trimming each such matching suffix from the form, we create a set of candidate roots R(f,p) This is used We have detected four ways in which a word can change in the context of having a title word its paradigm over the time:  the word underwent changes in one or more of its Next, the validation phase follows For each candidate roots; root R(f,p)[i] generate a list of fictive flexion forms  the word migrated to another paradigm; F(R(f,p)[i], p) by attaching the suffixes imposed by p to  the word is a noun and changed its grammatical the candidate root R(f,p)[i] Define a score for R(f,p)[i] as gender; the number of automatically generated (above called  a combination of the above deformations fictive) flexion forms in F(R(f,p)[i], p) which are detected in any of the eDTLR citations or in the sections dedicated This research deals only with detecting and inferring the to morphological specifications If none of the candidates forms which underwent a root change For our study, we have a score higher then 0, then conclude that f cannot be have taken into consideration only the forms which are framed under paradigm p Otherwise, conclude that the not present in the morphologic dictionary of the current root having the best score, R(f,p)[j], is an old deformed Romanian language and can be obtained in conjugation or root The forms F(R(f,p)[j], p) can now be inferred and declination from a known lemma (for which a paradigm is morphologically classified due to the data model of the known) paradigms, which associate a part of speech for each suffix that they contain In the present study we have considered only nouns, adjectives and verbs (the three categories with the richest Since the chronology of the citation can be mapped to all morphology) in Romanian words belonging to it, the inferred forms after applying the root changing algorithm, once detected in some 3 The algorithm citation, become automatically attributed to the In the following, we refer to a word as being “known” if it time/period of the citation is present in the morphological dictionary of the current Some details of the algorithm, hidden in the short presentation of above, can best be understood following 3the examples below But first, a short description of the Romania being rather a conservative and stubborn society, sometimes the rules imposed by Romanian paragraphs which specify morphological variations for Academy, the only forum that has the right to impose words is given changes in the official orthography, are not obeyed by everyone For instance, the 1993 new orthography When a fictive form is searched in eDTLR, it is looked up regulations have divided the society in two currents: those in the citations, and also in paragraphs which specify accepting to use â in the inner position and those insisting morphological variations for words Such paragraphs to keep the old written form î (among other details) At have a somewhat standardized format Below is a sample least, the language taught in schools is always conformant for the word dator (the formatting was kept exactly like in with the academic regulations the source): generated Out of the three occurrences of the forms -óare, dătór, -oáre, (învechit) datóriu, -oáre, datúr, -úrie, dănțată, dănțat and dănțând, only one belongs to a deatóriu, -ie, dătóriu, -ie, (regional) deatór, -oáre (ALR citation (the original form dănțată), the other two being SN I h 1 006/95), ditór, -oáre adj , subst - Lat debitorius, examples of form variation, and the chronology -a, -um (după da4) indicates the year 1854 for this form To look up a fictive form in such a source, the words that it contains must be extracted first For many word forms, Example 2: title word dator (adjective indebted) this format specifies only their ending A parser was made The word from our previous example, dansa, has only one to extract complete words from this It was not a trivial root for all its modern and old flexion forms In this task This format was not developed for computer parsing example we illustrate an adjectival paradigm which It was developed for the average human reader For a accepts two roots, each with its own suffixes native Romanian speaker it comes natural to understand the flexion form of a word, given a suffix But from a For the title word dator, the occurrence deatori computational point of view, to obtain complete word (masculine plural with no definiteness) has been found in forms from this format is quite a delicate problem The one of its citations The paradigm for dator accepts two suffix must be attached at the end of the first complete groups of suffixes, each one in combination with a form from its left –urie for instance must be attached to different root: datur This attachment has its own rules Because the  dator-{Ø ul ului i ii ilor} ending ur matches in the beginning of urie only ie must be  datoar-{e ea ei ele elor} attached to datur to form the complete word daturie And We noted with Ø the empty ending of course, there are exceptions to the rules For example, the correct way to attach –oare to deator is deatoare Such The matching of the deatori form against the endings in exceptions were implemented using a map of common the two groups, succeeds only in the first group in the endings which require special handling positions 1 and 4 The matches: deatori-Ø and deator-i, trigger, respectively, two candidate roots: deatori and eator However, the paradigm of the modern word dator Example 1: title word dansa (verb to dance) d imposes all endings of the first group be combined with The paradigm this word is part of accepts the following the same root As said, we will oblige all virtual form to suffixes: dans-{a am ai ați au asem aseși ase aserăm stick to the restrictions of the modern paradigm aserăți aseră ează ez ezi ează ăm ați eze ași ă arăm arăți Therefore, for the deatori candidate root, the virtual ară ând ându at ată ați ate} For example dansa has the generated forms out of the endings in the first group are: root dans and the suffix a, which is associated with the {deatori, deatoriul, deatoriului, deatorii, deatoriii, infinitive form (homonymous with past simple third deatoriilor} Out of these, deatorii is found once in the person singular for this particular paradigm) dictionary Secondly, for the deator candidate root, the corresponding virtual forms would be: {deator, deatorul, The word dănțată (past participle) has been found in a deatorului, deatori, deatorii, deatorilor} This time two citation under the dansa title word In this case, two different forms are found: deatorii and deator For the suffixes match: ă and ată so there are two candidate roots: reasons explained, forms like, for instance, deatori-elor dănțat and dănț won’t be searched for when validating The validation of the candidates goes as follows: In this case both candidates yielded scores greater than 0,  The dănțat root generates the forms: dănțata still the one with the best score is considered as the actual dănțatam dănțat dănțatai dănțataserăm root of the deformed version of this word, and that is dănțatezi etc In counting the occurrences of the deator – which is correct actually Moreover, let’s notice generated forms in the dictionary we, of course, that the only form among those generated from the root do not include the original form (in our case deatori which had one occurrence in the dictionary is also dănțată) There has been no match found, so the among the forms derived from the second root, deator, score was 0; and this ensures that the solution is safe  Out of the dănț root, the generated forms are: dănța dănțam dănțau dănțând dănțaserăm etc But what would have happened if the form deator Leaving out the occurrences of the original form, wouldn’t have been found in the dictionary? This would two of the generated forms are found in eDTLR, be a case of equal scores, which is resolved in the benefit which gives a score of 2 (these forms are dănțat of the shorter root This heuristic seems to guess in most and dănțând) of the cases the correct root Of course, when a root is inferred incorrectly, a set of incorrect old, deformed Since one root yielded a score higher than 0, the root dănț flexion forms, are expected to be inferred incorrectly is considered a deformed root and inserted in the diachronic morphologic dictionary under the same Example 3: title word deschide (verb open) paradigm as dansa, so all its flexion forms will also be This example is a more complex one, for which the shorter root heuristic happens to be applied The flexion The algorithm relies on the shortest root heuristic in only form dășchise was found in a citation 35 cases, which means 1 29% of the total cases The suffixes which are accepted by the paradigm of deschide, grouped by different roots, are :  deschid-{ e eam eai ea eați eau eți Ø em ă}; We manually evaluated all the inferred roots for the nouns and adjectives only The correctors were 20 master  deschi-{sesem seseși sese seserăm seserăți seseră sei seși se serăm serăți seră s să și}; students in Computational Linguistics Each student  deschiz-{i ând ându} received a packet which contained random entries, where an entry is a word with one of its roots automatically There are three endings which match at the end of inferred by the presented algorithm The other roots were nknown The correctors’ job was to identify the roots dășchise: Ø, e and se They belong to two different groups u of suffixes They generate the following candidate roots, which were inferred erroneously and also to type in the with the corresponding virtual forms (which will be unknown roots looked up in the entire eDTLR): 1 dășchise (trimmed Ø) with the virtual forms: Each entry was randomly distributed to 2 correctors After dășchisee dășchiseeam dășchise dășchiseem dășchiseă the first phase of the correction, the contradictions between the pachets have been reveald to the students In 2 dășchis (trimmed -e) with the virtual forms: dășchise dășchiseam dășchis dășchisem dășchisă the next phase they discussed and negociated upon the 3 dășchi (trimmed -se) with the virtual forms: correctness of their choises, and the number of dășchisesem dășchiseseși dășchis dășchisă contradictions decreased At the end there were still some dășchiși contradictions left By counting the entries which were in the end considered correct by both students, we got a total The first root yielded a score of 0, because none of the of 2,064 correct inferred roots, out of the total of 2,120 virtual forms were found anywhere in eDTLR For the entries This represents a percentage of 97 36% for the second root, dășchis-Ø and dășchis-ă were found (the case of nouns representation –ă is used to illustrate the manner in which these virtual forms were generated) resulting in a score of 2 The third root also yielded a score of 2 because the We have chosen to leave out the verbs from this virtual forms dășchi-s and dășchi-să were found The evaluation/correction student project, because the task forms which determined the score for the second and third was considered too difficult and prone to many errors for candidate roots are the same It is an unfortunate subjects without proper linguistic training coincidence that such a situation occurred Not only the paradigm of deschide permits this, but also the citations The total number of new roots manually typed in by the and morphologic variations paragraphs from eDTLR used students and which proved to be consistent in different to validate this case contained only the two forms dășchis correction packets was 550 The number of roots which and dășchisă Maybe if there was one more such old form did conflict was, however, 181 The small number of roots present in eDTLR, it would have led to a clearer result inserted manually is explained by the fact that only a third Still, the shorter root heuristic applies resulting in the root (36%) of the nouns and adjectives contained more than dășchi to be considered the correct one – which is true one root actually The big ratio of conflict (24 76%) is explained by the fact 4 Results that we were very much constrained by time in the second part (the confrontation part), when the negotiations The morphologic dictionary of the current Romanian between correctors had to happen About half of the language, which is used for determining if a word is students didn’t manage to contribute to this second part at “known” or not, contains a total of 1 15 million forms, all corresponding to approx 145,000 distinct lemmas Guessing a root of an old word, is a tricky process and The algorithm described above was applied for 41,911 requires an extensive knowledge about the history of the entries (the letters D, P, S, V) out of the total of language, provided that there are words which were approximately 175,000 title words, as the whole written 400 years ago in the data given for dictionary contains For these entries the dictionary correction/completion includes 205,654 citations We have found a total of 14,782 unknown flexion forms which have a known 5 Conclusions lemma Out of them we inferred a total of 22,697 new flexion forms, by using 7,295 forms that were found in the Determining the forms the words had over time, anchored entries as pilot forms (citations of the morphological in deformation of roots and the paradigmatic morphology, specification paragraph) In total, we have classified is the first step in inferring general rules of evolution of morphologically 29,870 old, unknown words The total the Romanian language Out of this study we aim to number of new roots inferred was 2,705 for 1,938 known reconstruct the general trends that governed the evolution lemmas of Romanian language The following steps would be dedicated to the investigation of the other cases of variation of word paradigms, mentioned in the first section After precisely defining the paradigms associated with each title word and the interval of time each paradigm had been in use, we intend to build chronological records of each title word, by arranging their paradigms on the time axis Then we will correlate these chronological records in search for patterns of variation, with the intent to infer the rules that govern the language evolution Various resources will be built in the process, which could be used for creating fascinating tools, such as a diachronic part of a speech tagger, or a tool which would automatically predict the interval in which a text has been written 6 Acknowledgements The research conducted in this article was partially supported by the ICT-PSP projects MetaNet4U and Atlas projects 7 Bibliography Cristea, D , Forăscu, C (2006) Linguistic Resources and Technologies for Romanian Language In Journal of Computer Science of Moldova, Academy of Science of Moldova, Institute of Mathematics and Computer Science, vol 14, nr 1(40), pp 34-73, ISSN 1561-4042 Cristea, D , Răschip, M , Forăscu, C , Haja, G , Florescu, C , Aldea, B , Dănilă, E (2007): The Digital Form of the Thesaurus Dictionary of the Romanian Language In Proceedings of SpeD 2007 Speech Technology and Human - Computer Dialogue, Iasi, May 10-12, 2007 Rosetti, A et al , (1968 – 1973) Istoria literaturii române, București: Editura Academiei Republicii Socialiste România Gheție, I (1977) (coord ) Istoria limbii române literare Epoca veche (1532-1780), București, Editura Academiei Române Gheție, I , Chivu, G (2000) (coord ) Contribuții la istoria limbii române literare Secolul al XVIII-lea (1688-1780), București, Editura Academiei Române D Tufiş “It Would Be Much Easier If WENT Were GOED”, in Proceedings of the 4th European Conference of the Association for Computational Linguistics, Manchester, 1989 